Exercise 10.1

The three big questions for evaluating a model are: 1)Is the model fair? Always ask this question and evaluate by questioning how data was collected, by whom and for what, and how might data affect individuals/communities.

  1. How wrong is the model? We can use pp-check() to see if central tendency and spread of model is similar to data.

  2. How accurate are the posterior predictive models? We can use posterior predictive summaries to assess accuracy.

Exercise 10.2

  1. Unfairness in data collection - Only collecting data on political preferences from people who work on Capitol Hill.

  2. Unfairness in purpose of data collection - Using a model to determine placement in AP classes for incoming high school students (thinking of tracking and how many students and parents don’t know when and how a student is placed in a certain tack that affects post secondary outcomes; I only learned about my track/tracking in general in a coincidental conversation with admin in early high school. More on tracking: https://www.nassp.org/tracking-and-ability-grouping-in-middle-level-and-high-schools/)

  3. Unfairness in the impact of analysis on society - using a model to predict housing prices that overestimates prices and raises housing costs (think of Zillow scandal)

  4. Bias baked into the analysis - collecting data on demographics (race, BMI, gender) to create a model of survivorship on different medications to assign treatment to future patients based on demographics. Such a model bakes in racism, sexism, and fatphobia that affects survivorship.

Exercise 10.3: Perspective

  1. My perspective is influenced by being a 1.5 generation immigrant and Black/African woman living in the South.

  2. There are many American subcultures for which I don’t have a complete or near-complete picture of. I also don’t have an inside view of how men socialize.

  3. I have more insight on the immigration process than those who’ve never gone through it and living in the South gives me a truer sense of Southern cultures than Northeastern media simplifications.

Exercise 10.4

  1. I would tell the colleague they were socialized and educated in a society experiencing coloniality that tries to naturalize various -isms and science and data analysis is not immune to this hegemony and that it can be argued that data science helps perpetuate it, so therefore they’re not neutral. I could urge the colleague to check the three big questions at minimum and more questions like am I measuring race or racism, woman-dependent outcomes or sexism, etc.

  2. I would respond that the model, unless efforts have been made to check for and remove bias in its data (garbage data in, garbage model out), data structure, and the premise and purpose of what is being modeled, the model will have bias baked in.

  3. I once worked as a research assistant for a global health study exploring how youth understand HIV/AIDS transmission and we worked with narratives.I was able to provide some additional context to the research team about some content in the narratives from Kenyan youth; I know someone who grew up in Kenya would have probably given even better context, but even raised abroad I was able to give some insight.

Exercise 10.5

This quote means that models are ideal representations of a real phenomena so they don’t capture all the nuances. Models that are less wrong than others give useful approximations that give a good sense of summary statistics that may be close to actual parameters.

Exercise 10.6

The three assumptions of the Normal Bayesian linear regression model are: 1) Conditioned on X, the observed data Y on case i is independent of the observed data on any other case j.

  1. The typical Y outcome can be written as a linear function X, mu = B0 + B1X.

  2. At any X value, Y varies normally around mu with consistent variability sigma.

#loading libraries for exercises
library(bayesrules)
library(tidyverse) 
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
##    * Does _not_ affect other ggplot2 plots
##    * See ?bayesplot_theme_set for details on theme setting
library(rstanarm)
## Loading required package: Rcpp
## This is rstanarm version 2.21.1
## - See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!
## - Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.
## - For execution on a local, multicore CPU with excess RAM we recommend calling
##   options(mc.cores = parallel::detectCores())
library(rstan)
## Loading required package: StanHeaders
## rstan (Version 2.21.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## 
## Attaching package: 'rstan'
## The following object is masked from 'package:tidyr':
## 
##     extract
library(tidybayes)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(broom.mixed)

Exercise 10.7

  1. I would use the values of B_0, B_1 and sigma and sensible priors to create a Bayesian linear regression model. I can view the data to make informed choices about priors. Then I would use the model to predict more values of y and compare these to values of y in the dataset to assess if I have a good model.

See below.

#creating vectors 
x <- c(12,10,4,8,6)
y <- c(20,17,4,11,9)
line_data <- data.frame(x,y)
line_data #looking at data.frame to get a sense of the tendency and spread 
##    x  y
## 1 12 20
## 2 10 17
## 3  4  4
## 4  8 11
## 5  6  9
plot(x,y)

#create simple linear regression with the given data
lm(y ~ x, data = line_data) #slope is the 2nd coefficient based on googling and this is just 1 model of these points. 
## 
## Call:
## lm(formula = y ~ x, data = line_data)
## 
## Coefficients:
## (Intercept)            x  
##        -3.8          2.0
line_model <- stan_glm(y ~ x, data = line_data, #create simulation with given parameter set 
                       family = gaussian,
                       prior_intercept = normal(-1.8),
                       prior = normal(2.1, .8), 
                       prior_aux = exponential(1.25), #arrived at 1.25 wiht 1/.8
                       chains = 4, iter = 5000*2, seed = 80)
beta_0 <- -1.8 #these values given as first parameter set 
beta_1 <- 2.1
sigma  <- .8
set.seed(678)
y_simulation <- line_data |> 
  mutate(mu = beta_0 + beta_1 * x,
         simulated_y = rnorm(5, mean = mu, sd = sigma)) |> #we have 5 data points
  select(x, y, simulated_y)

head(y_simulation, 5) #Compare data to simulation.Part b) The predictions for Y1:Y5 are shown in the column labeled simulated_y.
##    x  y simulated_y
## 1 12 20   22.781211
## 2 10 17   19.946564
## 3  4  4    6.973068
## 4  8 11   14.132278
## 5  6  9    9.075090

Alternatively could have done this more simply with vectors and this would have produced the estimates Steve showed in class. These estimates are similar to the ones above. The ones above are within 1 or 2 sigma of the below estimates.

y_estimates <- beta_0 + beta_1*x
y_estimates #y estimates with N~(-1.8 + 2.1x_i, 0.8)
## [1] 23.4 19.2  6.6 15.0 10.8
#comparing data to simulated y's visually 
ggplot(y_simulation, aes(x = simulated_y)) + 
  geom_density(color = "red") + 
  geom_density(aes(x = y), color = "darkblue")

#simulation and data follow similar rises and falls. I feel comfortable generating more values based on this. For the majority of values the simulation is an underestimate. 
# Examine 80 simulated samples
pp_check(line_model, nreps = 80) + 
  xlab("lines")

#This plot shows that most of the lines capture the range and central tendency of our limited data set.

Exercise 10.8

  1. The goal of a posterior predictive check is to check whether a model is able to simulate data that is similar to the data used to create the model. The posterior predictive check gives insight to whether the assumptions that a Y outcome can be written as a linear function and that at any X value, Y varies normally around mu with consistent variability sigma.

  2. Posterior predictive checks can be analyzed by comparing the simulated datasets to the observed dataset and looking for similarities in the spread and typical values of the simulated and observed data. If a model has a significantly different mean, range, and shape then the observed data, we know we have a bad model.

Exercise 10.9

  1. The median absolute error (MAE) tells us the typical difference between observed Y_i values and the Y_i’ means from the model. It helps us judge quality of the model.

  2. The scaled median absolute error measures the typical number of standard deviations that observed Y_i differs from the posterior predictive Y_i’ means. It can be an improvement over MAE without a scale because it standardizes the difference which eases comparisons between models that may have different portions of the data we’ve used to train and test.

  3. The within-50 statistic tells us what proportion of observed Y_i that falls within 50% of the posterior prediction interval. If this number is really low we know there’s something off about the model.

Exercise 10.10

  1. In pp_check() the darker density is the actual observed data. The light blue density represents simulated data from a model.

  2. If model fits well the light blue density will have similar central tendency or tendencies and spread as the dark blue density. A good fitting model will produce a plot like this because the model has the same data structure as the observed data and produces predictions mostly within 2 or 3 standard deviations of the observed value.

  3. If a model fits poorly, it may have a different shape, different central tendency, and different range (either much more less) than the observed data.

Exercise 10.11

  1. The data are the taco recipes.
  2. The model is Reem. Tacos with different recipes are the input to Reem and the output is some review.
  3. Cross validation could be done by splitting the data into different subsets. In this case separating the the anchovie-based recipes and non-anchovie-based recipes and testing Reem on their flavors. We could give her 90% of the tacos without the anchovies for model training and then test with the anchovies or some other type of split.
  4. Cross-validation would help identify which recipe combinations are the tastiest and could help point me in the direction of the range of taco flavors to offer. I also think there could be an argument of adding more models so more people besides Reem has tasted the tacos and compare cross-validation estimates within and between the models or people in this case.

Exercise 10.12

  1. The fours steps for the k-fold cross validation algorithm are:
  1. Create folds. Let k be some integer from 2 to sample size n. Split the data into k non-overlapping folds (subsets) of roughly equal size.
  2. Train and test the model by training with combined data in the first k-1 folds and testing the model with k^th data fold. Measure prediction quality by MAE or other way.
  3. Repeat step two k-1 times and each time leave out a different fold for testing.
  4. Average the k measures from steps 2 and 3 to obtain 1 cross-validation estimate of prediction quality.
  1. When you use the same data to train and test a model, the model has been optimized to capture features in the data it was built with so it produces an overly optimistic predictive accuracy.

  2. My questions are how to generate how to split folds in R and in what sociological context would we use leave-one-out.

Exercise 10.13

Load data needed for exercises.

library(bayesrules)
data("coffee_ratings")
coffee_ratings <- coffee_ratings |> 
  select(farm_name, total_cup_points, aroma, aftertaste)
view(coffee_ratings)
  1. Use head to look at subsample of coffee_ratings.
head(coffee_ratings, 15)
## # A tibble: 15 × 4
##    farm_name                                   total_cup_points aroma aftertaste
##    <fct>                                                  <dbl> <dbl>      <dbl>
##  1 "metad plc"                                             90.6  8.67       8.67
##  2 "metad plc"                                             89.9  8.75       8.5 
##  3 "san marcos barrancas \"san cristobal cuch"             89.8  8.42       8.42
##  4 "yidnekachew dabessa coffee plantation"                 89    8.17       8.42
##  5 "metad plc"                                             88.8  8.25       8.25
##  6  <NA>                                                   88.8  8.58       8.42
##  7  <NA>                                                   88.8  8.42       8.33
##  8 "aolme"                                                 88.7  8.25       8.5 
##  9 "aolme"                                                 88.4  8.67       8.58
## 10 "tulla coffee farm"                                     88.2  8.08       8.5 
## 11 "fahem coffee plantation"                               88.1  8.17       8.25
## 12 "el filo"                                               87.9  8.25       8.17
## 13 "los cedros"                                            87.9  8.08       8.33
## 14 "arianna farms"                                         87.9  8.33       8.08
## 15 "aolme"                                                 87.8  8.25       8.5

Modeling ratings by aroma likely violates the independence assumption (conditioned on X, observed data Y_i on case i is independent of the observed data on any other case j) because given X = aroma and Y = total_cup_points a reviewer is likely comparing i cup of coffee to j cup of coffee to make determination on aroma. Similarly with aftertaste as X and Y remaining total_cup_points, a taster may evaluate aftertaste of i depending on aftertaste of j. In other words, if you’ve never tried anything other than Folgers or my favorite, Nescafe Clasico, then you won’t know if it has better or worse aroma or aftertaste than another brand.

  1. Take one observation per farm and name this subsample as new_coffee for the other excercises.
set.seed(84735)
new_coffee <- coffee_ratings |> 
  group_by(farm_name) |> #telling R how to group data for new set
  sample_n(1) |> #telling R to take 1 sample from each group 
  ungroup()
dim(new_coffee) #new_coffee has 572 observations with 4 variables 
## [1] 572   4

Exercise 10.14 (Coffee ratings: model it)

lm(total_cup_points ~ aroma, data = new_coffee) #one estimate of parameters for a well fitting line; we see slope estimate (6.163) is positive 
## 
## Call:
## lm(formula = total_cup_points ~ aroma, data = new_coffee)
## 
## Coefficients:
## (Intercept)        aroma  
##      35.397        6.163
ggplot(new_coffee, aes(y = total_cup_points, x = aroma)) + #plotting aroma on x axis and points on y axist  
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

The relationship between aroma and coffee rating is positive. As aroma grade increases total_cup_points increases. One model indicates that for each unit increase in aroma, cup points increase by about 6 units.

  1. Could also use stan, but in b we’re asked to use stan_glm to model.
coffee_model <- stan_glm(total_cup_points ~ aroma, data = new_coffee,
                       family = gaussian,
                       prior_intercept = normal(75, 10), #used range rule for sd and 75 was given 
                      prior = normal(7.5, .2), #7.5 seems like a reasonable mu for B_1 based on scanning the data
                       prior_aux = exponential(0.1),#found l by 1/10
                       chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame as we might need this later. 
coffee_model_df <- as.data.frame(coffee_model)
  1. For c I’ll use tidy() to for a posterior summary of aroma.
#in this step I ahve a numerical psterior summary for aroma 
tidy(coffee_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
##   term        estimate std.error conf.low conf.high
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    29.0     1.20      27.4      30.5 
## 2 aroma           7.01    0.158      6.81      7.21
## 3 sigma           1.98    0.0590     1.90      2.05
## 4 mean_PPD       82.1     0.117     81.9      82.2
# 50 simulated model lines to provide visual summary of B_1 coefficient aroma
new_coffee |> 
  add_fitted_draws(coffee_model, n = 50) |> 
  ggplot(aes(x = aroma, y = total_cup_points)) +
    geom_line(aes(y = .value, group = .draw), alpha = 0.7, color = "grey") + 
    geom_point(data = new_coffee, size = 0.05, color = "blue")
## Warning: `fitted_draws` and `add_fitted_draws` are deprecated as their names were confusing.
## Use [add_]epred_draws() to get the expectation of the posterior predictive.
## Use [add_]linpred_draws() to get the distribution of the linear predictor.
## For example, you used [add_]fitted_draws(..., scale = "response"), which
## means you most likely want [add_]epred_draws(...).

#plot shows 50 posterior plausible sets of B_0 and B_1 (aroma). Notice that aroma is less plausible below 7.0.
  1. The posterior median relationship is 28.975120 + 7.010832X meaning that for every one unit increase in aroma we expect total_cup_points to increase by 7.010832.

  2. I would say that yes I do based on my plots in a and c that show a positive correlation between aroma and cup score. The posterior summary of aroma shows that the 80% credible interval is above zero (6.8 - 7.2).

Exercise 10.15

#finding first set of the posterior plausible parameter sets (B_0, B_1, sigma)
first_set <- head(coffee_model_df, 1)
first_set
##   (Intercept)    aroma    sigma
## 1    27.27295 7.241494 1.909072
#from observed data on each of the 572 observations in the new_coffee data we simulate a total_cup_points outcome from the normal data model tuned to this first parameter set 

beta_0 <- first_set$`(Intercept)`#remember that beta_0 is intercept
beta_1 <- first_set$aroma #x values 
sigma <- first_set$sigma

set.seed(84735)
one_simulation <- new_coffee |> 
  mutate(mu = beta_0 + beta_1 * aroma, #this is linear equation with data from the first set
         simulated_points = rnorm(572, mean = mu, sd = sigma)) |> #remember 572 is sample size 
  select(aroma, total_cup_points, simulated_points)
  1. Next compare several of the data total_cup_points to the simulated points and then make density plot.
head(one_simulation, 3)
## # A tibble: 3 × 3
##   aroma total_cup_points simulated_points
##   <dbl>            <dbl>            <dbl>
## 1  7.67             84               84.1
## 2  7.33             76.2             80.1
## 3  6.75             67.9             77.6

Density plot.

ggplot(one_simulation, aes(x = simulated_points)) +
  geom_density(color = "lightblue") +
  geom_density(aes(x = total_cup_points), color = "darkblue")

  1. Use pp_check()
#examine 100 of the 20000 simulated points
pp_check(coffee_model, nreps = 100) +
  xlab("points")

  1. The 100 simulated lines capture the typical points (somewhere above 80) and the range. Given this, I would say that assumption 2 (that the outcome can be written as a linear function) and assumption 3 (at any X values, Y varies normally around mu with consistent variability sigma) are reasonable.

Exercise 10.16

  1. Without using posterior_predict(), simulate and plot a posterior predictive model for the rating of this batch.
new_coffee |> 
  filter(farm_name == "-") |> #the first batch only had a hyphen for a name but this filter still worked. I could have also just looked at the data to see the aroma and points since these values are first in the list 
  select(aroma, total_cup_points)
## # A tibble: 1 × 2
##   aroma total_cup_points
##   <dbl>            <dbl>
## 1  7.67               84
#simulate posterior predictive model for first farm.

set.seed(84735)
predict_firstfarm <- coffee_model_df |> 
  mutate(mu = `(Intercept)` + aroma*7.67,
         point_new = rnorm(20000, mean = mu, sd = sigma))

#plot this model
ggplot(predict_firstfarm, aes(x = point_new)) + 
  geom_density() + 
  geom_vline(xintercept = 84, color = "lightgreen")  #added this line to show where score for first farm in observed data = 84 

  1. Without using prediction_summary() calculate and interpret raw & standardized posterior predictive error.
#raw predictive error 
predict_firstfarm |> 
  summarize(mean = mean(point_new), error = 84 - mean(point_new))
##       mean    error
## 1 82.74027 1.259727

The model under-predicted the points by 1.259727.

#standardized posterior error 
predict_firstfarm |>  
  summarize(sd = sd(point_new), error = 84 - mean(point_new),
            error_scaled = error / sd(point_new))
##        sd    error error_scaled
## 1 1.97503 1.259727    0.6378268

The score of 84 we observed in the new_coffee data is 1.97503 standard deviations above the mean prediction.

  1. construct ppc_intervals() plot.
#in this case we'll use posterior_predict 
coffpredictions <- posterior_predict(coffee_model, newdata = new_coffee)
dim(coffpredictions)
## [1] 20000   572
ppc_intervals(new_coffee$total_cup_points, yrep = coffpredictions, x = new_coffee$aroma,  prob = 0.5, prob_outer = 0.95, size = 0.8) #adjusted size to make blue lines easier to see 

For each of the 572 total_cup_points, the 95% prediction interval is displayed as the narrow long blue bars and the 50% prediction interval is show in the wider short blue bars. The posterior predictive median is the light blue dot.What we are seeing is the center and spread of each posterior predictive model which informs us of how compatible each model is with the observed points (dark blue dots). Visually it looks like few of the observed points are outside of the 50% or 95% prediction intervals which means I can have confidence in the accuracy of the model.

  1. how many batches have ratings that are within 50% posterior prediction interval.
set.seed(84735)
prediction_summary(coffee_model, data = new_coffee)
##         mae mae_scaled within_50 within_95
## 1 0.9414586  0.4794058 0.6520979  0.958042

65.20979% of batches have ratings that are within their 50% posterior prediction interval.

Exercise 10.17

set.seed(84735)

cv_procedurecoff <- prediction_summary_cv(
  model = coffee_model, data = new_coffee, k = 10) #used 10 because of Goldilocks challenge
cv_procedurecoff
## $folds
##    fold       mae mae_scaled within_50 within_95
## 1     1 0.9835034  0.4906481 0.6034483 0.9310345
## 2     2 1.0054881  0.5200194 0.5263158 0.9298246
## 3     3 0.9521365  0.4646638 0.7192982 0.9649123
## 4     4 1.0410444  0.5109289 0.6140351 0.9824561
## 5     5 0.9363774  0.5297648 0.6315789 0.9473684
## 6     6 0.9295120  0.4599348 0.7368421 0.9824561
## 7     7 0.7660163  0.3831139 0.6842105 0.9649123
## 8     8 1.0092968  0.5084666 0.6140351 0.9298246
## 9     9 0.8018329  0.3974392 0.6842105 0.9824561
## 10   10 0.6138494  0.3039137 0.6379310 0.9655172
## 
## $cv
##         mae mae_scaled within_50 within_95
## 1 0.9039057  0.4568893 0.6451906 0.9580762
  1. Median absolute error (mae) tells us that the typical difference between observed points and posterior predictive mean is .903. The scaled MAE tells us the typical number of standard deviations that the observed points fall from their posterior predictive means is .456 standard deviations. 64.52% of the observed points fall within 50% of the posterior prediction interval. 95.81% of the observed points fall within the 95% posterior prediction interval.

  2. Looking at the MAE among the folds we see that it ranged from being as low as 0.613 to as high as 1.04. Looking at the variation among the four cross-validated metrics shows how some training models perform better on some sets than others. The within_50 metric was as low as 0.526 for one fold. The metrics found from the average of these 10 folds seems to represent the trend of the folds well.

Exercise 10.18

  1. To assess if it’s fair we should answer how the data was collected, by whom and for what, and how it might impact individuals and societies. Using ?coffee_ratings we see that the data was collected by James leDeoux, a data scientist, and shared through R for DataScience, a learning initiative. The data doesn’t have protected personal information. It probably won’t impact society or individuals poorly except for those farms rated the least were people to use this to change their coffee consumption choices. Overall, I’d say the model is fair. Exercise 10.17 supports that the model has useful accuracy.

Exercise 10.19

  1. Before using stang_glm() used visualizations to inform a prior.
lm(total_cup_points ~ aftertaste, data = new_coffee) #one estimate of parameters for a well fitting line; we see slope estimate (6.623) is positive 
## 
## Call:
## lm(formula = total_cup_points ~ aftertaste, data = new_coffee)
## 
## Coefficients:
## (Intercept)   aftertaste  
##      33.121        6.623
ggplot(new_coffee, aes(y = total_cup_points, x = aftertaste)) + #plotting aroma on x axis and points on y axis  
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

aftertaste_model <- stan_glm(total_cup_points ~ aftertaste, data = new_coffee,
                       family = gaussian,
                       prior_intercept = normal(75, 10), #kept same prior for rating from earlier exercise 
                      prior = normal(7.4, .3), #7.4 seems like a reasonable mu for B_1 based on second visualization for this problem .3 is an estimate of the range rule vale 
                       prior_aux = exponential(0.1),#found l by 1/10
                       chains = 4, iter = 5000*2, seed = 84735)
  1. Plot to see if model is wrong.
pp_check(aftertaste_model, nreps = 100) +
  xlab("total_cup_points")

#finding 10-fold cross validation of the posterior predictive quality 
set.seed(84735)

cv_procedureaf <- prediction_summary_cv(
  model = aftertaste_model, data = new_coffee, k = 10) 
cv_procedureaf
## $folds
##    fold       mae mae_scaled within_50 within_95
## 1     1 0.5291000  0.3272169 0.7758621 0.9827586
## 2     2 0.7338549  0.4742131 0.7017544 0.9473684
## 3     3 0.7298899  0.4483136 0.7192982 1.0000000
## 4     4 0.5269603  0.3262044 0.8596491 0.9649123
## 5     5 0.5482930  0.3938378 0.7017544 0.9473684
## 6     6 0.6901362  0.4327767 0.6315789 0.9824561
## 7     7 0.8247415  0.5229725 0.7017544 0.9298246
## 8     8 0.7082116  0.4511447 0.6666667 0.9649123
## 9     9 0.9796816  0.6103597 0.5789474 0.9824561
## 10   10 0.8477898  0.5358133 0.7241379 0.9655172
## 
## $cv
##         mae mae_scaled within_50 within_95
## 1 0.7118659  0.4522853 0.7061404 0.9667574
  1. Note that the aftertaste MAE of .711 is lower than MAE of model based on aroma (0.90) and a within_50 metric of .706 which is higher than the model based on aroma (.64). The second model also has a higher within_95 metric. Due to these factors, I would pick aftertaste as the better predictor of coffee bean ratings.

Open-ended Exercises

Exercise 10.20

  1. Fit the model using stan_glm()

First I’ll visualize the data to inform some priors.

lm(maxtemp ~ mintemp, data = weather_perth) #the second value is one possible slope 
## 
## Call:
## lm(formula = maxtemp ~ mintemp, data = weather_perth)
## 
## Coefficients:
## (Intercept)      mintemp  
##     14.4681       0.8355
ggplot(weather_perth, aes(y = maxtemp, x = mintemp)) + 
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Next I’ll do stan_glm()

temp_model <- stan_glm(maxtemp ~ mintemp, data = weather_perth,
                       family = gaussian,
                       prior_intercept = normal(25, 11), #points seem to cluster around 25 on maxtemp and used estimate of range rule for sd
                      prior = normal(15, 7.5), #points seem to cluster around 15 on mintemp and used estmate of rangerule for sd
                       prior_aux = exponential(0.09),#found l by 1/11
                       chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame 
temp_model_df <- as.data.frame(temp_model)

Use tidy() to for a posterior summary of mintemp.

#in this step I ahve a numerical posterior summary for mintemp 
tidy(temp_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
##   term        estimate std.error conf.low conf.high
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   14.5      0.380    14.0      14.9  
## 2 mintemp        0.836    0.0264    0.802     0.870
## 3 sigma          4.30     0.0964    4.18      4.42 
## 4 mean_PPD      25.6      0.192    25.3      25.8

Note credible intervals are above 0.

The posterior median relationship is 14.47 + 0.84X meaning that for every 1 degree increase in the mintemp, we can expect that maxtemp to increase by 0.84 degrees.

  1. Evaluate model and summarize findings.
  1. First I’ll do a pp_check()
pp_check(temp_model, nreps = 50) +
  xlab("maxtemp")

Notice the model doesn’t capture bimodality of the maxtemp that could represent seasonal changes or something else (maybe wildfires?). It captures the range, but has a slightly larger one.

  1. I’ll do ppc_intervals next.
predictions <- posterior_predict(temp_model, newdata = weather_perth)
dim(predictions)
## [1] 20000  1000
ppc_intervals(weather_perth$maxtemp, yrep = predictions, x = weather_perth$mintemp, prob = 0.5, prob_outer = 0.95)

prediction_summary(temp_model, data = weather_perth)
##       mae mae_scaled within_50 within_95
## 1 3.23352  0.7519272     0.458     0.969

Note that 45% of observed maxtemps fall within the 50% posterior prediction interval.

  1. Finally, I’ll cross-validate.
cv_proceduret <- prediction_summary_cv(
  model = temp_model, data = weather_perth, k = 10) 
cv_proceduret
## $folds
##    fold      mae mae_scaled within_50 within_95
## 1     1 3.237666  0.7491671      0.44      1.00
## 2     2 2.667141  0.6121590      0.55      0.97
## 3     3 3.200698  0.7478611      0.47      0.96
## 4     4 3.657437  0.8669770      0.36      0.96
## 5     5 3.389756  0.7879079      0.44      0.95
## 6     6 3.342260  0.7779194      0.41      0.94
## 7     7 3.103401  0.7211185      0.48      0.99
## 8     8 3.062961  0.7127143      0.48      0.98
## 9     9 3.182260  0.7374859      0.46      0.97
## 10   10 3.146443  0.7258124      0.48      0.96
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 3.199002  0.7439123     0.457     0.968

This model has ok or useful predictive accuracy and captures the generally positive relationship between min temp and max temp. However, we know the model could be improved because of the below 50% within_50 metrics and because it doesn’t capture the bimodality of the sample data.

Exercise 10.21

First I’ll visualize the data to inform some priors.

lm(rides ~ humidity, data = bikes) #the second value is one possible slope 
## 
## Call:
## lm(formula = rides ~ humidity, data = bikes)
## 
## Coefficients:
## (Intercept)     humidity  
##    3933.722       -7.117
ggplot(bikes, aes(y = rides, x = humidity)) + 
  geom_point(size = 0.5) +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Notice that there’s a weakly negative association between humidity and rides. As humidity increases rides decrease slightly.

Next I’ll do stan_glm()

humid_model <- stan_glm(rides ~ humidity, data = bikes,
                       family = gaussian,
                       prior_intercept = normal(5000, 1000), #using priors from ch. 9 about the average and sd of rides on an average temp day 
                      prior = normal(63, 15), #most of the humidity values are between 50 and 75 so I choose 63 as kind of a midpoint. Used range rule approx. for sd 
                       prior_aux = exponential(0.0008),#used prior from chp. 9
                       chains = 4, iter = 5000*2, seed = 84735)
#Store 4 chains for each parameter in 1 data frame 
humid_model_df <- as.data.frame(humid_model)
# numerical posterior summary for humidity 
tidy(humid_model, effects = c("fixed", "aux"),
     conf.int = TRUE, conf.level = 0.8)
## # A tibble: 4 × 5
##   term        estimate std.error conf.low conf.high
##   <chr>          <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)  3559.      295.    3190.     3933.  
## 2 humidity       -1.06      4.50    -6.75      4.60
## 3 sigma        1576.       50.1   1513.     1643.  
## 4 mean_PPD     3492.      101.    3362.     3620.

The posterior median relationship between riders and humidity is 3558.56 + -1.06X meaning that for each unit increase in humidity, riders decrease by 1.06.

  1. To evaluate my model I will do a pp_check, ppc_intervals, and cross-validation.

#first pp_check
pp_check(humid_model, nreps = 50) +
  xlab("rides")

#ppc_intervals 
humidpredictions <- posterior_predict(humid_model, newdata = bikes)
dim(humidpredictions)
## [1] 20000   500
ppc_intervals(bikes$rides, yrep = humidpredictions, x = bikes$humidity, prob = 0.5, prob_outer = 0.95)

prediction_summary(humid_model, data = bikes)
##        mae mae_scaled within_50 within_95
## 1 1189.441   0.750041     0.462     0.966
#cross-validation
cv_procedureh <- prediction_summary_cv(
  model = humid_model, data = bikes, k = 10) 
cv_procedureh
## $folds
##    fold       mae mae_scaled within_50 within_95
## 1     1  908.4528  0.5703438      0.58      0.96
## 2     2 1159.7020  0.7419827      0.48      0.98
## 3     3 1233.6827  0.7781523      0.44      1.00
## 4     4 1280.5089  0.8075703      0.40      0.98
## 5     5 1242.0071  0.7886023      0.46      0.98
## 6     6  776.6916  0.4883848      0.60      0.94
## 7     7 1077.5036  0.6757442      0.50      0.98
## 8     8 1711.2308  1.0873563      0.34      0.98
## 9     9 1396.8398  0.8888582      0.44      0.90
## 10   10 1373.2332  0.8769758      0.38      0.96
## 
## $cv
##        mae mae_scaled within_50 within_95
## 1 1215.985  0.7703971     0.462     0.966

The ppcheck shows the model captures the range and central tendency mostly well; it doesn’t capture that the observed rides have bimodality. Note that 46.2% of observed rides fall within the 50% posterior prediction interval based on the ppc_intervals function.
The cross validation shows similar within_50 metrics and the mae_scaled shows a low typical number of standard deviations that the observed rides fall from the posterior predictive mean. Overall this model is ok at capturing the weakly negative relationship between humidity and rides. An improved model would capture the bimodality of the observed data. We should keep in mind that a lot of variability in this model comes from the widely dispersed observed data; the ppc_intervals have wide bands that reflect this.